Parallel Classi cation for Data Mining on Shared-Memory Multiprocessors
نویسندگان
چکیده
We present parallel algorithms for building decision-tree classi ers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. We evaluate the performance of these algorithms on two machine con gurations: one in which data is too large to t in memory and must be paged from a local disk as needed and the other in which memory is su ciently large to cache the whole data. This performance evaluation shows that the construction of a decision-tree classi er can be e ectively parallelized on an SMP machine with good speedup. For the local disk con guration, the speedup ranged from 2.97 to 3.86 for the build phase and from 2.20 to 3.67 for the total time on a 4-processor SMP. For the large memory con guration, the range of speedup was from 5.36 to 6.67 for the build phase and from 3.07 to 5.98 for the total time on an 8-processor SMP.
منابع مشابه
SPRINT: A Scalable Parallel Classi er for Data Mining
Classi cation is an important data mining problem. Although classi cation is a wellstudied problem, most of the current classication algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. We present a new decision-tree-based classi cation algorithm, called SPRINT that removes all of the memo...
متن کاملParallel Classification for Data Mining on Shared-Memory Multiprocessors
We present parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach u...
متن کاملTowards a Cost-Effective Parallel Data Mining Approach
Massive rule induction has recently emerged as one of the powerful data mining techniques. The problem is known to be exponential in the size of the attributes, and given its ever increasing use, can greatly benefit from parallelization. In this paper, we study cost-effective approaches to parallelize rule generation algorithms. In particular, we consider the propositional rule generation algor...
متن کاملScalable Data Mining for Rules
Data Mining is the process of automatic extraction of novel, useful, and understandable patterns in very large databases. High-performance scalable and parallel computing is crucial for ensuring system scalability and interactivity as datasets grow inexorably in size and complexity. This thesis deals with both the algorithmic and systems aspects of scalable and parallel data mining algorithms a...
متن کاملDecision Trees on Parallel Processors
A framework for induction of decision trees suitable for implementation on shared-and distributed-memory multiprocessors or networks of workstations is described. The approach , called Parallel Decision Trees (PDT), overcomes limitations of equivalent serial algorithms that have been reported by several researchers, and enables the use of the very-large-scale training sets that are increasingly...
متن کامل